Entropy and information in neural spike trains: Progress on the sampling problem 
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The major problem in information theoretic analysis of neural responses and other biological 
data is the reliable estimation of entropy-like quantities from small samples. We apply a recently 
introduced Bayesian entropy estimator to synthetic data inspired by experiments, and to real exper- 
imental spike trains. The estimator performs admirably even very deep in the undersampled regime, 
where other techniques fail. This opens new possibilities for the information theoretic analysis of 
experiments, and may be of general interest as an example of learning from limited data. 
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I. INTRODUCTION 



There has been considerable progress in using infor- 
mation theoretic methods to sharpen and to answer 
many Questions about the structure of the neural code 
HSSI1I3 II1I1I1 Where classical experimental ap- 
proaches have focused on mean response of neurons to 
relatively simple stimuli, information theoretic methods 
have the power to quantify the responses to arbitrarily 
complex and even fully natural stimuli H llfjfl . taking 
account of both the mean response and its variability 
in a rigorous way, independent of detailed modeling as- 
sumptions. Measurements of entropy and information in 
spike trains also allow us to test directly the hypothesis 
that the neural code adapts to the distribution of sensory 
inputs, optimizi ng t he rate or efficiency of information 
transmission [lj El 13 III El 

A problem with such measurements is that entropy and 
information depend explicitly on the full distribution of 
neural responses, just a limited sample of which is pro- 
vided by experiments. In particular, we need to know 
the distribution of responses to each stimulus in our en- 
semble, and the number of samples from this distribution 
is limited by the number of times the full set of stimuli 
can be repeated. For natural stimuli with long corre- 
lation times the time required to present a useful "full 
set of stimuli" is long, limiting the number of indepen- 
dent samples we can obtain from stable neural recordings. 
Furthermore, natural stimuli generate neural responses of 
high timing precision, and thus th e sp ace of meaningful 
responses itself is very large y, [T(j, EH LUl • These factors 
make the sampling problem more serious as we move to 
more interesting and natural stimuli. 
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A natural response to this problem is to give up the 
generality of a completely model independent informa- 
tion theoretic approach. Some explicit help from models 
is required to regularize learning of the underlying proba- 
bility distributions from the experiments. The question is 
if we can keep the generality of our analysis by introduc- 
ing the gentlest of regularizations for the abstract learn- 
ing problem, or if we need stronger assumptions about 
the structure of the neural code itself (for example, in- 
troducing a metric on the space of responses 18, 19]). 

A classical problem suggests that we may succeed even 
with very weak assumptions. Remember that one needs 
to have only TV ~ 23 people in a room before any two 
of them are reasonably likely to share the same birthday. 
This is much less than K — 365, the number of pos- 
sible birthdays. Turning this around, we can estimate 
the number of possible birthdays by polling N people 
and counting how often we find coincidences. Once N is 
large enough to have observed a few of those, we can get 
a pretty good estimate of K . This will happen with a 
significant probability for N ~ \f~K <C K. 

The idea of estimating entropy by counting coinci- 
dences was proposed long ago by Ma [2(j for physical 
systems in the microcanonical ensemble where distribu- 
tions should be uniform at fixed energy. Clearly, if wc 
could generalize the Ma idea to arbitrary distributions, 
then we would be able to explore a much wider variety 
of question about information in the neural code. Here 
we argue that a simple and abstract Bayesian prior, in- 
troduced in Ref. [2]]], comes close to the objective. 

It is well known that, for N < K, there are no uni- 
versally good entropy estimators jH 111] • Thus the main 
question is: does a particular method work well only for 
(possibly irrelevant) abstract model problems, or can it 
also be trusted for natural data? Hence our goal is nei- 
ther to search for potential theoretical limitations of the 
approach (these must exist and have been found) , nor to 
analyze the neural code (this will be left for the future). 
Instead we aim at convincingly showing that the method 
of Ref. 0| can generate reliable estimates of entropy well 
into a classically undersampled regime for an experimen- 
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tally relevant case of neurophysiological recordings. 



II. AN ESTIMATION STRATEGY 

Consider the problem of estimating the entropy S of 
a probability distribution {px}, S — — J^i = iPilog 2 Pj, 
where the index i runs over K possibilities (e.g., K possi- 
ble neural responses). In an experiment we observe that 
in N examples each possibility i occurred ri\ times. If 
N K, we approximate the probabilities by frequen- 
cies, pi » fx = rix/N, and construct a naive estimate of 
the entropy, 



S„ 



K 

E 



/ilog 2 /i- 



(1) 



This is also a maximum likelihood estimator, since the 
maximum likelihood estimate of the probabilities is given 
by the frequencies. Thus we will replace S n aivc by S ML 
in what follows. 

It is well know that S ML underestimates the entropy 
(cf. Ref. [13). With good sampling (N >• K), classical 
arguments due to Miller [24[ show that the ML estimate 
should be corrected by a universal term (K— 1)/2N, and 
several groups have used this correction in the analysis of 
neural data. In practice, many bins may have truly zero 
probability (for example, as a result of refractoriness; see 
below) , and the samples from the distribution might not 
be completely independent. Then S Mh still deviates from 
the correct answer by a term oc 1/N, but the coefficient 
is no longer known a priori. Under these conditions one 
can heuristically verify and extrapolate the 1/N behav- 
ior from subsets of the available data Q. Alternatively, 
still agreeing on the 1/N correction, one can calculate its 
coefficient (interpretable as an effective number of bins 
K*) for some classes of distributions |25l 1261 127 | . All of 
these approaches, however, work only when the sampling 
errors are in some sense a small perturbation. 

If we want to make progress outside of the asymptot- 
ically large N regime we need an estimator that does 
not have a perturbative expansion in 1/N with Sml as 
the zeroth order term. The estimator of Ref. [21j has 
just this property. Recall that Sml is a limiting case of 
Bayesian estimation with Dirichlct priors. Formally, we 
consider that the probability distributions p = {px} are 
themselves drawn from a distribution Vpip) of the form 



i 



Z(P;K) 
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where the delta function enforces normalization of dis- 
tributions p and the partition function Z(J3\ K) normal- 
izes the prior Vp{p). Maximum likelihood estimation is 
Bayesian estimation with this prior in the limit (3 — > 0, 
while the natural "uniform" prior is /3 = 1. The key ob- 
servation of Ref. [2l| is that while these priors are quite 



smooth on the space of p, the distributions drawn at ran- 
dom from Vp all have very similar entropies, with a vari- 
ance that vanishes as K becomes large. Fundamentally, 
this is the origin of the sample size dependent bias in en- 
tropy estimation, and one might thus hope to correct the 
bias at its source. The goal then is to construct a prior 
on the space of probability distributions which generates 
a nearly uniform distribution of entropies. Because the 
entropy of distributions chosen from Vp is sharply de- 
fined and monotonically dependent on the parameter /3, 
we can come close to this goal by an average over (3, 

Pnsb(p)<x J d/3 dS{ ^ K) V p (p). (3) 

Here S((3; K) is the average entropy of distributions cho- 
sen from 

S{(3-K) = i = MKI3+l)-Md + ^), (4) 

where ip m (x) = (d/dx) m+1 log 2 r(a;) are the polygamma 
functions. 

Given this prior, we proceed in standard Bayesian fash- 
ion. The probability of observing the data n = {nx} given 
the distribution p is 
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P(n|p)ocJ]^S 
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and then 



P(p|n) = P(n|p)P NSB (p) 



(6) 
(7) 



P(n)' 

P(n) = J dpP(n|p)P NS B(p), 

-Yj>xlog 2 px\ P(p|n). (8) 

Here we need to calculate the first two posterior moments 
of the entropy, m = 1,2, in order to have an access to 
the entropy estimate and to its variance as well. 

The Dirichlet priors allow all the (K dimensional) in- 
tegrals over p to be done analytically, so that the compu- 
tation of S NSB and of its posterior error reduces to just 
three numerical one-dimensional integrals: 



S 



NSB\ 
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where 



r(/?(0) 



(9) 
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where the one-to-one relation between (3 and £ is given 
by Eq. 10} , and £?^*(n) is the expectation value of the 
m-th entropy moment at fixed /3; the exact expression 
for m = 1 , 2 is given in Ref. (2^| . 

Details of the NSB method can be found in Refs. pH 
29] , and the source code of the implementations in either 
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Octave/C++ or plain C++ is available from the authors. 
We draw attention to several points. 

First, since the analysis is Bayesian, we obtain not 
only S NSB but also its a posteriori standard deviation, 
— an error bar on our estimate, see Eq. ©. 

Second, for N — ► oo and N/K — > the estimator ad- 
mits asymptotic analysis. The important parameter is 
the number of coincidences A = N — K\, where K\ is the 
number of bins with non-zero counts. If A/N — > const < 
1 (many coincidences), then the standard saddle point 
evaluation of the integrals in Eq. is possible. Inter- 
estingly, the second derivative at the saddle is (In 2 2) A 
to the leading order in A/N. The second asymptotic can 
be obtained for A ~ O(N ) (few coincidences). Then 



-iNSB 



a 



In 2 



SS 



NSB 



l + 21og 2 7V-V>o(A) : 



(11) 
(12) 



where C 7 is the Euler's constant. This is particularly 
interesting since S NSB happens to have a finite limit for 
K — > co, thus allowing entropy estimation even for infi- 
nite (or unknown) cardinalities. 

Third, both of the above asymptotics show that the 
estimation procedure relies on A to make its estimates; 
this is in the spirit of Ref . [2(j . 

Finally, S* NSB is unbiased if the distribution being 
learned is typical in Vp(p) for some /3, that is, its rank 
ordered (Zipf) plot is of the form 



PB(J3, K0-0)(K-l)i~\ 1/(K/3 -« 



K 



/3B(/3,K/3- /3)(K + 
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for i/K — > and i/K — > 1 respectively. If the Zipf plot 
has tails that are too short (too long), then the estima- 
tor should over (under) estimate. While underestimation 
may be severe (though always strictly smaller than that 
for S Mh ), overestimation is very mild, if present at all, 
in the most interesting regime 1 C A < iV. gNSB j g 
also unbiased for distributions that are typical in some 
weighted combinations of Vp for different /?'s, in particu- 
lar in "Pnsb itself. However, the typical Zipf plots in this 
case are more complicated and will be detailed elsewhere. 

Before proceeding it is worth asking what we hope to 
accomplish. Any reasonable estimator will converge to 
the right answer in the limit of large N. In particular, 
this is true for (S NSB , which is a consistent Bayesian es- 
timator |35| . The central problem of entropy estimation 
is systematic bias, which will cause us to (perhaps signif- 
icantly) under- or overestimate the information content 
of spike trains or the efficiency of the neural code. The 
bias, which vanishes for N — > oo, will manifest itself as 
a systematic drift in plots of the estimated value versus 
the sample size. A successful estimator would remove 
this bias as much as possible. Ideally we thus hope to see 
an estimate which for all values of N is within its error 



bars from the correct answer. As N increases the error 
bars should narrow, with relatively little variation of the 
(mean) estimate itself. When data are such that no reli- 
able estimation is possible, the estimator should remain 
uncertain, that is, the posterior variance should be large. 
The main purpose of this paper is to show that the NSB 
procedure applied to natural and nature-inspired syn- 
thetic signals comes close to this ideal over a wide range 
of N <C K, and even N <C 2 s . The procedure thus is a 
viable tool for experimental analysis. 



III. A MODEL PROBLEM 

It is important to test our techniques on a problem 
which captures some aspects of real world data yet is 
sufficiently well defined that we know the correct an- 
swer. We constructed synthetic spike trains where in- 
tervals between successive spikes were independent and 
chosen from an exponential distribution with a dead time 
or refractory period of g = 1.8 ms; the mean spike rate 
was r = 0.26 spikes/ms. This corresponds to the rate 
of tq = r/(l — rg) = 0.49 spikes/ms for the part of the 
signal, where spiking is not prohibited by refractoriness. 
These parameters are typical of the high spike rate, noisy 
regions of the experiment discussed below, which provide 
the greatest challenge for entropy estimation. 

Following the scheme outlined in Ref. 0, we examine 
the spike train in windows of duration T = 15 ms and 
discretize the response with a time resolution r = 0.5 ms. 
Because of the refractory period each bin of size r can 
contain at most one spike, and hence the neural response 
is a binary word with T/t = 30 letters. The space of re- 
sponses has K — 2 30 ps 10 9 possibilities. Of course, most 
of these have probability exactly zero because of refrac- 
toriness, and the number of possible responses consistent 
with this constraint is bounded by ~ 2 16 w 10 5 . An ap- 
proximation to the entropy of this distribution, isgiven 
by an appropriate correction to Eq. (3.21) of Ref. [9j, the 
entropy of a non-refractory Poisson process: 



S = 



rT 
ln2 



-ln(l-e" 



r re 



-r T 



1 



-r T 



= 13.57 bits. 



(15) 

In Fig. n we show the results of entropy estimation 
for this model problem. As expected, the naive esti- 
mate S ML reaches its asymptotic behavior only when 
N > 2 s , thus the 1/N extrapolation becomes successful 
at N ~ 10 4 (the "ML fit" line on the plot). In contrast, 
we see that S B gives the right answer within errors at 
N ~ 100. We can improve convergence by providing the 
estimator with the "hint" that the number of possible 
responses K is much smaller than the upper limit of 2 30 , 
but even without this hint we have excellent entropy es- 
timates already at N ~ (2 s ) 1 / 2 . This is in accord with 
expectations from Ma's analysis of (microcanonical) en- 
tropy estimation |20| . However, here we achieve these 
results for a nonuniform distribution. 



4 



Refractory spikes, T= 15 ms, r = 0.5 ms 
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FIG. 1: Entropy estimation for a model problem. Notice that 
the estimator reaches the true value within the error bars as 
soon as TV 2 ~ 2 s , at which point coincidences start to occur 
with high probability. Slight overestimation for N > 10 3 is 
expected (see text) since this distribution is atypical in "Pnsb- 
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FIG. 3: Slice entropy vs. sample size 
plots is drawn at the value of S lNSB | JV _ JV 
the estimator is stable within its error bars even for very low 
TV. Triangle corresponds to the value of g ML extrapolated to 
N — *• oo from the four largest values of N. First and second 
panels show examples of word lengths for which Sml can or 
cannot be reliably extrapolated. S NSB is stable in both cases, 
shows no N dependent drift, and agrees with S ML where the 
latter is reliable. 



FIG. 2: Data from a fly motion sensitive neuron in a natural 
stimulus setting. Top: a 500 ms section of a 10 s angular 
velocity trace that was repeated 196 times. Bottom: raster 
plot showing the response to 30 consecutive trials; each dot 
marks the occurrence of a spike. 



IV. ANALYZING REAL DATA 

For a test on real neurophysiological data, we use 
recordings from a wide field motion sensitive neuron (HI) 
in the visual system of the blowfly Calliphora vicina. 
While action potentials from HI were recorded, the fly 
rotated on a stepper motor outside among the bushes, 
with time dependent angular velocity representative of 
natural flight. Figure |3 presents a sample of raw data 
from such an experiment (see Ref. 01 f° r details). 

Following Ref. 0, the information content of a spike 
train is the difference between its total entropy and the 
entropy of neural responses to repeated presentations 
of the same stimulus |3^. The latter is substantially 
more difficult to estimate. It is called the noise entropy 
S n , since it measures response variations that are un- 
corrected with the sensory input. The noise in neurons 



depends on the stimulus itself — there are, for example, 
stimuli which generate with certainty zero spikes in a 
given window of time — and so we write S n u to mark the 
dependence on the time t at which we take a slice through 
the raster of responses. In this experiment the full stimu- 
lus was repeated 196 times, which actually is a relatively 
large number by the standards of neurophysiology. The 
fly makes behavioral decisions based on ~ 10 — 30 ms 
windows of its visual input |30| . and under natural con- 
ditions the time resolution of the neural responses is of 
order 1 ms or even less jl() . so that a meaningful anal- 
ysis of neural responses must deal with binary words of 
length 10 — 30 or more. Refractoriness limits the number 
of these words which can occur with nonzero probabil- 
ity (as in our model problem), but nonetheless we easily 
reach the limit where the number of samples is substan- 
tially smaller than the number of possible responses. 

Let us start by looking at a single moment in time, 
t = 1800 ms from the start of the repeated stimulus, as 
in Fig. |2| If we consider a window of duration T = 16 ms 
at time resolution r = 2 ms |37j . we obtain the entropy 
estimates shown in the first panel of Fig. [31 Notice that 
in this case we actually have a total number of samples 
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N=75 
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5 ^ NSB 10 15 

FIG. 4: Distribution of the normalized entropy error condi- 
tional on 5 ,NSB (Af max ) for N — 75 and r = 0.75 ms. Darker 
patches correspond to higher probability. The band in the 
right part of the plot is the normal distribution around zero 
with the standard deviation of 1 (the standard deviation of 
plotted conditional distributions averaged over 5 NSB is about 
0.7, which indicates a non-Gaussian form of the posterior for 
small number of coincidences |2S^1. For values of g NSB U p to 
about 12 bits the estimator performs remarkably well. For 
yet larger entropies, where the number of coincidence is just 
a few, the discrete nature of the estimated values is evident, 
and this puts a bound on reliability of S NSB . 

which is comparable to or larger than 2 s ™i t , and so the 
maximum likelihood estimate of the entropy is converg- 
ing with the expected 1/N behavior. The NSB estimate 
agrees with this extrapolation. The crucial result is that 
the NSB estimate is correct within error bars across the 
whole range of N; there is a slight variation in the mean 
estimate, but the main effect as we add samples is that 
the error bars narrow around the correct answer. In this 
case our estimation procedure has removed essentially all 
of the sample size dependent bias. 

As we open our window to T = 30 ms, the number 
of possible responses (even considering refractoriness) is 
vastly larger than the number of samples. As we see in 
the second panel of Fig. |3 any attempt to extrapolate 
the ML estimate of entropy now requires some wishful 
thinking. Nonetheless, in parallel with our results for the 
model problem, we find that the NSB estimate is stable 
within error bars across the full range of available N. 

For small T we can compare the results of our Bayesian 
estimation with an extrapolation of the ML estimate; 
each moment in time relative to the repeated stimulus 
provides an example. We have found that the results in 
the first panel of Fig. [3] are typical: in the regime where 
extrapolation of the ML estimator is reliable, our estima- 
tor agrees within error bars over a broad range of sample 
sizes. More precisely, if we take the extrapolated ML 
estimate as the correct answer, and measure the devia- 
tion of S NSB from this answer in units of the predicted 
error bar, we find that the mean square value of this nor- 
malized error is of order one. This is as expected if our 



estimation errors are random rather than systematic. 

For larger T we do not have a calibration against the 
(extrapolated) S ML , but we can still ask if the estima- 
tor is stable, within error bars, over a wide range of 
N. To check this stability we treat the value of S NSB 
at N — 7V max — 196 as our best guess for the en- 
tropy and compute the normalized deviation of the es- 
timates at smaller values of N from this guess, e = 
[S NSB (/V) - S NSB (/V max )] /SS NSB (N). Again, each mo- 
ment in time is an example. Figure 01 shows the distri- 
bution of these normalized deviations conditional on the 
entropy estimate with N — 75; this analysis is done for 
t = 0.75 ms, with T in the range between 1.5 ms and 
22.5 ms. Since the different time slices span a range of 
entropies, over some range we have N > 2 s , and in this 
regime the entropy estimate must be accurate (as in the 
analysis of small T above). Throughout this range, the 
normalized deviations fall in a narrow band with mean 
close to zero and a variance of order one, as expected if 
the only variations with the sample size were random. 
Remarkably this pattern continues for larger entropies, 
S > log 2 N = 6.2 bits, demonstrating that our estimator 
is stable even deep into the undersampled regime. This 
is consistent with the results obtained in our model prob- 
lem, but here we find the same answer for the real data. 

Note that Fig. 0|illustrates results with TV less than one 
half the total number of samples, so we really are testing 
for stability over a large range in TV. This emphasizes 
that our estimation procedure moves smoothly from the 
well sampled into the undersampled regime without ac- 
cumulating any clear signs of systematic error. The pro- 
cedure collapses only when the entropy is so large that 
the probability of observing the same response more than 
once (a coincidence) becomes negligible. 



V. DISCUSSION 

The estimator we have explored here is constructed 
from a prior that has a nearly uniform distribution of 
entropies. It is plausible that such a uniform prior would 
largely remove the sample size dependent bias in entropy 
estimation, but it is crucial to test this experimentally. In 
particular, there are infinitely many priors which are ap- 
proximately (and even exactly) uniform in entropy, and 
it is not clear which of them will allow successful estima- 
tion in real world problems. We have found that the NSB 
prior almost completely removed the bias in the model 
problem (Fig.0. Further, for real data in a regime where 
undersampling can be beaten down by data the bias is 
removed to yield agreement with the extrapolated ML 
estimator even at very small sample sizes (Fig. [3J first 
panel). Finally and most crucially, the NSB estimation 
procedure continues to perform smoothly and stably past 
the nominal sampling limit of N ~ 2 s , all the way to the 
Ma cutoff N 2 ~ 2 s (Fig.EJ. This opens the opportunity 
for rigorous analysis of entropy and information in spike 
trains under a much wider set of experimental conditions. 
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[35] In reference to Bayesian estimators, consistency usually 
means that, as N grows, the posterior probability con- 
centrates around unknown parameters of the true model 
that generated the data. For finite parameter models, 
such as the one considered here, only technical assump- 
tions like positivity of the prior for all parameter values, 
soundness (different para meters always correspond to dif- 
ferent distributions) |31| , and a few others are needed for 
consistency. For nonparametric models, the situation is 
more complicated. There one also needs ultraviolet con- 
vergence of the functional integrals defined by the prior 

Eiall 

[36] It may happen that information is a small difference be- 
tween two large entropies. Then, due to statistical errors, 
methods that estimate information directly will have an 



advantage over NSB, which estimates entropies first. In 
our case, this is not a problem since the information is 
roughly a half of the total available entropy Q- 
[37] For our and many other neural systems, the spike tim- 
ing can be more accurate than the refractory period of 
roughly 2 ms 0, [ToL |34| . For the current amount of data, 
discretization of r <C 1 ms and large enough T will push 
the limits of all estimation methods, including ours, that 
do not make explicit assumptions about properties of the 
spike trains. Thus, to have enough statistics to convinc- 
ingly show validity of the NSB approach, in this paper we 
choose t — 0.75 ... 2 ms, which is still much shorter than 
other methods can handle. We leave open the possibility 
that more information is contained in timing precision at 
finer scales. 



